Task 1

a)

For this task, we are asked to find if bookmakers are good at deciding the over/under bets. I selected 5 bookmakers for that task. Those bookmakers are:

Pinnacle

Betsafe

Sportingbet

Tipico

WilliamHill

First of all, I read the data and calculated the necessary values by below code.

library(data.table)
library(anytime)
library(plotly)
matches<-data.table(readRDS("df9b1196-e3cf-4cc7-9159-f236fe738215_matches.RDS"))
odds<-data.table(readRDS("df9b1196-e3cf-4cc7-9159-f236fe738215_odd_details.RDS"))
matches=unique(matches)

#Converting the date from Unix
matches[,match_date:=anydate(date)]
matches[,match_time:=anytime(date)]
matches=matches[order(home,-match_time)]
matches[,c("match_date","date"):=NULL]

#Finding Out the Over Games
matches[,c("HomeGoals","AwayGoals"):=tstrsplit(score,':')]
matches$HomeGoals=as.numeric(matches$HomeGoals)
matches$AwayGoals=as.numeric(matches$AwayGoals)
matches[,TotalGoals:=HomeGoals+AwayGoals]
matches[,IsOver:=0]
matches[TotalGoals>2,IsOver:=1]
matches=matches[complete.cases(matches)]

#Finding the Year, Month, Date and Hour Information
matches[,Year:=year(match_time)]
matches[,Month:=month(match_time)]
matches[,Weekday:=wday(match_time)]
matches[,Hour:=hour(match_time)]

#Selecting the Over Under Bets with Total Handicap of 2.5
odds_ov_un=odds[betType=='ou' & totalhandicap=='2.5']
odds_ov_un[,totalhandicap:=NULL]

#Finding Out the Inital and Final Bets
odds_ov_un=odds_ov_un[order(matchId, oddtype,bookmaker,date)]

odds_ov_un_initial=odds_ov_un[,list(start_odd=odd[1]),
                              by=list(matchId,oddtype,bookmaker)]

odds_ov_un_final=odds_ov_un[,list(final_odd=odd[.N]),
                            by=list(matchId,oddtype,bookmaker)]

Then I found out the inital and final odds for my selected bookmakers.

pinnacle_over_under_initial=odds_ov_un_initial[bookmaker=='Pinnacle']
Betsafe_over_under_initial=odds_ov_un_initial[bookmaker=='Betsafe']
Sportingbet_over_under_initial=odds_ov_un_initial[bookmaker=='Sportingbet']
Tipico_over_under_initial=odds_ov_un_initial[bookmaker=='Tipico']
WilliamHill_over_under_initial=odds_ov_un_initial[bookmaker=='William Hill']

pinnacle_over_under_final=odds_ov_un_final[bookmaker=='Pinnacle']
Betsafe_over_under_final=odds_ov_un_final[bookmaker=='Betsafe']
Sportingbet_over_under_final=odds_ov_un_final[bookmaker=='Sportingbet']
Tipico_over_under_final=odds_ov_un_final[bookmaker=='Tipico']
WilliamHill_over_under_final=odds_ov_un_final[bookmaker=='William Hill']


pinnacle_wide_initial=dcast(pinnacle_over_under_initial,
                    matchId~oddtype,
                    value.var='start_odd')
Betsafe_wide_initial=dcast(Betsafe_over_under_initial,
                            matchId~oddtype,
                            value.var='start_odd')
Sportingbet_wide_initial=dcast(Sportingbet_over_under_initial,
                           matchId~oddtype,
                           value.var='start_odd')

Tipico_wide_initial=dcast(Tipico_over_under_initial,
                               matchId~oddtype,
                               value.var='start_odd')


WilliamHill_wide_initial=dcast(WilliamHill_over_under_initial,
                          matchId~oddtype,
                          value.var='start_odd')

pinnacle_wide_final=dcast(pinnacle_over_under_final,
                            matchId~oddtype,
                            value.var='final_odd')

Betsafe_wide_final=dcast(Betsafe_over_under_final,
                           matchId~oddtype,
                           value.var='final_odd')
Sportingbet_wide_final=dcast(Sportingbet_over_under_final,
                               matchId~oddtype,
                               value.var='final_odd')

Tipico_wide_final=dcast(Tipico_over_under_final,
                          matchId~oddtype,
                          value.var='final_odd')


WilliamHill_wide_final=dcast(WilliamHill_over_under_final,
                               matchId~oddtype,
                               value.var='final_odd')

I selected my bins with a difference of 0.1.

For Pinnacle;

#Pinnacle

##Initial

merged_matches=merge(matches,pinnacle_wide_initial,by='matchId')

merged_matches[,probOver:=1/over]
merged_matches[,probUnder:=1/under]

merged_matches[,totalProb:=probOver+probUnder]

merged_matches[,probOver:=probOver/totalProb]
merged_matches[,probUnder:=probUnder/totalProb]

merged_matches=merged_matches[complete.cases(merged_matches)]
merged_matches[,totalProb:=NULL]

cutpoints=c(seq(0,1,0.1))
merged_matches[,odd_cut_over:=cut(probOver,cutpoints)]

summary_table=merged_matches[,list(empirical_over=mean(IsOver),
                                   probabilistic_over=mean(probOver),.N),
                             by=list(Year,odd_cut_over)]

summary_table=summary_table[order(Year)]

plot(summary_table[,list(empirical_over,probabilistic_over)],cex=4,main="Inital Probability Data",xlim=c(0,1),ylim=c(0,1))

abline(0,1,col='red')

Based on the plot for the initial odds, I can say that most of the points are on the x=y line, meaning that Pinnacle has a good record of estimating probabilities. We can also say that, for low over probabilities, most of the games did end up as under meaning that the bookmaker didn’t have to pay money to the betters. I run basically the same code for the final bets and for the other bookmakers, so I will move on with the plots instead.

Based on the graph for final odds, Tipico has actually adjusted their odds and we can see that more points now lie on the x=y line. Also, most of the games with low over probabilities did end up as under. So the adjustment was profitable for Tipico.

For Betsafe,

Based on the graph, I can say that if Betsafe kept with these odds, the company would have lost money. There some matches where the probability of being over is high, meaning that playing under pays better, and the matches ended up as under. Let’s see if the company adjusted these odds in the final probabilities graph.

According to the plot, some adjustments are made to the problematic matches. There are still some matches that are away from the empirical data, but overall the adjustment is better.

For Sportingbet,

Plot shows that, SportingBet is actually pretty spot on with the initial bets. Most of the points are around the x=y line. Lets see if the company made some adjustments to the outliers.

With some adjustments, the companys odds became more accurate. So far for all the companies, final odds are better at prediction than the initial odds. Lets see the last two companies.

For Tipico,

Most of the points are on the x=y line. There are no matches with high over probability that ended up as under. Let’s see if the adjustments made things worse.

This time the adjustment actually made things worse. Now, there are some matches that have a high over probability that ended up as under. If the company stuck with the inital bets, this wouldn’t have happened. This time the adjustment was a bad thing.

For William Hill,

Plot suggests that, William Hill is good at predicting odds. Let’s see the final odds graph.

After the adjustments, some matches with high over probabilities did end up as over. The company was better off without adjustment.

b)

I decided to continue with William Hill. With the below code, I generated graphs for the probability information over the years.

merged_matches=merge(matches,WilliamHill_wide_initial,by='matchId')

merged_matches[,probOver:=1/over]
merged_matches[,probUnder:=1/under]

merged_matches[,totalProb:=probOver+probUnder]

merged_matches[,probOver:=probOver/totalProb]
merged_matches[,probUnder:=probUnder/totalProb]

merged_matches=merged_matches[complete.cases(merged_matches)]
merged_matches[,totalProb:=NULL]

cutpoints=c(seq(0,1,0.1))
merged_matches[,odd_cut_over:=cut(probOver,cutpoints)]

summary_table=merged_matches[,list(empirical_over=mean(IsOver),
                                   probabilistic_over=mean(probOver),.N),
                             by=list(Year,odd_cut_over)]

summary_table=summary_table[order(Year)]

p1_initial <- plot_ly(summary_table, x = ~Year[odd_cut_over=="(0.4,0.5]"], y = ~empirical_over[odd_cut_over=="(0.4,0.5]"], name = 'Empirical for (0.4,0.5]', type = 'scatter', mode = 'lines')%>%
  add_trace(y = ~probabilistic_over[odd_cut_over=="(0.4,0.5]"], name = 'Probabilistic for (0.4,0.5]', mode = 'lines') %>%
  add_trace(y = ~empirical_over[odd_cut_over=="(0.5,0.6]"], name = 'Empirical for (0.5,0.6]', mode = 'lines') %>%
  
  add_trace(y = ~probabilistic_over[odd_cut_over=="(0.5,0.6]"], name = 'Probabilistic for (0.5,0.6]', mode = 'lines') %>% add_trace(y = ~probabilistic_over[odd_cut_over=="(0.6,0.7]"], name = 'Probabilistic for (0.6,0.7]', mode = 'lines') %>%add_trace(y = ~empirical_over[odd_cut_over=="(0.6,0.7]"], name = 'Empirical for ((0.6,0.7]]', mode = 'lines') %>%
  
  layout(xaxis = list(title="Years"),
    yaxis = list(title="Probability",range = c(0, 1)))
p1_initial
p2_initial<-plot_ly(summary_table, x = ~Year[odd_cut_over=="(0.3,0.4]"], y = ~empirical_over[odd_cut_over=="(0.3,0.4]"], name = 'Empirical for (0.3,0.4]', type = 'scatter', mode = 'lines')%>%
  add_trace(y = ~probabilistic_over[odd_cut_over=="(0.3,0.4]"], name = 'Probabilistic for (0.3,0.4]', mode = 'lines') %>%
  layout(xaxis = list(title="Years"),
         yaxis = list(title="Probability",range = c(0, 1)))

p2_initial
p3_initial<-plot_ly(summary_table, x = ~Year[odd_cut_over=="(0.7,0.8]"], y = ~empirical_over[odd_cut_over=="(0.7,0.8]"], name = 'Empirical for (0.7,0.8]', type = 'scatter', mode = 'lines')%>%
  add_trace(y = ~probabilistic_over[odd_cut_over=="(0.7,0.8]"], name = 'Probabilistic for (0.7,0.8]', mode = 'lines') %>%
  layout(xaxis = list(title="Years"),
         yaxis = list(title="Probability",range = c(0, 1)))
p3_initial

Based on these plots, initial probabilities follow the empirical data closely for bins (0.5,0.6], (0.4,0.5] and (0.6,0.7] . Meaning that the company was better predicting the probabilities that suggests a 50/50 chance of over and under.

For final odds,

Again it can be seen that the company is better at predicting the probabilities in the middle. It can also be said for the final probabilities that matches with low over probabilities did end up as under. For this reason, we can say that the company is good at predicting the under results as well.

An alternative visualization would be, instead of drawing two lines for a bin, we can draw the difference between empirical and probabilistic. This way the plot would be less messy. We can also say if that difference is around zero, bookmaker is doing a nice job predicting the odds. To demonstrate, I will use the final odds.

summary_table_dif=summary_table[,list(difference=empirical_over-probabilistic_over),by=list(Year,odd_cut_over)]

p_Change <- plot_ly(summary_table_dif, x = ~Year[odd_cut_over=="(0.4,0.5]"], y = ~difference[odd_cut_over=="(0.4,0.5]"], name = 'Difference for (0.4,0.5]', type = 'scatter', mode = 'lines')%>%
  add_trace(y = ~difference[odd_cut_over=="(0.5,0.6]"], name = 'Difference for (0.5,0.6]', mode = 'lines') %>%
  add_trace(y = ~difference[odd_cut_over=="(0.6,0.7]"], name = 'Difference for (0.6,0.7]', mode = 'lines') %>%add_trace(y = ~difference[odd_cut_over=="(0.3,0.4]"], name = 'Difference for (0.3,0.4]', mode = 'lines') %>%
  
  
  layout(xaxis = list(title="Years"),
         yaxis = list(title="Probability Difference",range = c(-1, 1)))
p_Change
p_Change1 <- plot_ly(summary_table_dif, x = ~Year[odd_cut_over=="(0.7,0.8]"], y = ~difference[odd_cut_over=="(0.7,0.8]"], name = 'Difference for (0.7,0.8]', type = 'scatter', mode = 'lines')%>%
layout(xaxis = list(title="Years"),
         yaxis = list(title="Probability Difference",range = c(-1, 1)))
p_Change1

The plots suggests the same thing I discussed above with better visualization.

Task 2

For this task, I selected the 12BET betting company. In the below code, I calculated the percent change in each odds (home win, away win and draw). I gathered all this information in a single data table called “merged_matches_change”.

matches<-data.table(readRDS("df9b1196-e3cf-4cc7-9159-f236fe738215_matches.RDS"))
odds<-data.table(readRDS("df9b1196-e3cf-4cc7-9159-f236fe738215_odd_details.RDS"))
matches=unique(matches)
matches[,c("HomeGoals","AwayGoals"):=tstrsplit(score,':')]
matches$HomeGoals=as.numeric(matches$HomeGoals)
matches$AwayGoals=as.numeric(matches$AwayGoals)

# x  represents the games ended with a draw, 1 is for home win and 2 is for away win
matches[,homewin:="x"]
#Finding out about who won

matches[HomeGoals>AwayGoals,homewin:=1]
matches[HomeGoals<AwayGoals,homewin:=2]
matches=matches[complete.cases(matches)]

odds_1x2=odds[betType=='1x2' & bookmaker=='12BET']

odds_1x2=odds_1x2[order(matchId, oddtype,bookmaker,date)]
odds_1x2[,totalhandicap:=NULL]
#Calculating the Odd Changes
odds_1x2_change=odds_1x2[,list(odd_change=((odd[.N]-odd[1])/odd[1])*100),
                          by=list(matchId,oddtype,bookmaker)]


odds_1x2_wide_change=dcast(odds_1x2_change,
                               matchId~oddtype,
                               value.var='odd_change')
merged_matches_change=merge(matches,odds_1x2_wide_change,by='matchId')

To visualize the relation between the odd changes and the match results, I utilized three box plots, one for each odd change. Box plot below shows the relation between home win odd change and results.

pchange1<-plot_ly(merged_matches_change, x = ~homewin, y = ~odd1, type='box') %>%
  layout(title='Home Win Odd Change',
         yaxis = list(title='Home Win Odd Change ',zeroline = FALSE),
         xaxis = list(title='Results',zeroline = FALSE))
pchange1

According to the box plot, for away wins, home win odd is most of the time increased with than 75% of the change is above zero. This is sensible as a the game time approaches, more information becomes available; weather conditions, player conditions, team formations etc. This change information can be used for predicting match results. Other two box plots for away win and draw is given below.

This box plot is again sensible. For home wins, away win odds are increased until the match time. Draw odds are not increased as much as away wins.

An interesting take from this plot is that, bookmakers usually don’t make big changes to the draw odds. This can be because predicting a draw is harder than predicting a loss or a win. So, they play safe with draw odds.

Interesting R Examples
  1. Exploring Survival on the Titanic

    This example is important because it is an R Notebook on Titanic data set. Titanic is considered as one of the Data Science 101 data set and this notebook provides a good exploratory data analysis. It is also a Kaggle Kernel, which is a very good source for this kind of example R codes. The notebook also covers feature engineering, missing data imputation and modeling.

  2. New York City Taxi Fare Prediction

    In this example, competitors were tasked with prediting the taxi fare given the pick up and drop off locations. It is important since this example deals with a large dataset. The competitors were also tasked with doing better predictions than the current models.

  3. Google Landmark Recognition Challenge

    In this example, competitors tried to recognize famous landmarks in a given huge dataset of pictures. This example combines image recognition with big datasets.

  4. Kobe Bryant Shot Selection

    This example has a large dataset consisting of all the shots Kobe Bryant took during his carreer. THe users were tasked to predict which of these shots made into the net.

  5. House Prices: Advanced Regression Techniques

    This is a great example for a beginner in data science. The competitors are tasked with predicting a houses price given the houses attributes.